Genome doubling enabled the expansion of yeast vesicle traffic pathways

Vesicle budding and fusion in eukaryotes depend on a suite of protein types, such as Arfs, Rabs, coats and SNAREs. Distinct paralogs of these proteins act at distinct intracellular locations, suggesting a link between gene duplication and the expansion of vesicle traffic pathways. Genome doubling, a common source of paralogous genes in fungi, provides an ideal setting in which to explore this link. Here we trace the fates of paralog doublets derived from the 100-Ma-old hybridization event that gave rise to the whole genome duplication clade of budding yeast. We find that paralog doublets involved in specific vesicle traffic functions and pathways are convergently retained across the entire clade. Vesicle coats and adaptors involved in secretory and early-endocytic pathways are retained as doublets, at rates several-fold higher than expected by chance. Proteins involved in later endocytic steps and intra-Golgi traffic, including the entire set of multi-subunit and coiled-coil tethers, have reverted to singletons. These patterns demonstrate that selection has acted to expand and diversify the yeast vesicle traffic apparatus, across species and time.

. Convergent retention of paralog doublets after whole genome duplication. (A) The yeast WGD clade is descended from a hybridization between parents related to present-day members of the ZT and KLE clades, followed by a whole genome duplication event. See Fig. 4 for more details. Phylogenetic branch lengths are from Ref. 12 . (B) Top: Cladogram based on the phylogenetic tree from (A), indicating the left (L) and right (R) sub-clades. Bottom: Each paralog can be placed into one of four groups depending on whether it is present as a doublet in any member of the L or R sub-clades. We show the total number of paralogs in each of these groups, for all 4866 ancestral doublets (left) and the 360 ancestral doublets encoding vesicle traffic proteins (right). P-value calculations are described in the main text and "Methods". (C) Paralog doublets (double red dots) can revert to singletons (single gray dots) by loss of one gene copy (crossed circle). Bold red lines: lineages where presence of a doublet can be inferred, since a doublet is present in a descendant. Light pink lines: lineages where a doublet was present but this cannot be inferred, since the doublet is lost in all descendants. Top: A single contingent deletion prior to the divergence of the clade. Bottom: Multiple convergent deletions on different branches, which can occur at low or high loss rates for different genes. (D) We test whether the presence or absence of doublets in the R sub-clade is predictive of the number of doublet species in the L subclade, conditioned on the doublet being present at the L-R branch point. The histograms in these two cases are significantly different, supporting the convergent scenario in (C). www.nature.com/scientificreports/ This suggests that the genes in this set play important roles in single copy, but that having paralog doublets is not typically advantageous 20,21 .

Selection drives convergent retention of doublets across species. Paralog doublets may undergo
neo-functionalization (where one copy takes on a non-ancestral function) or sub-functionalization (where the ancestral function is split between the two copies). Once either of these events occurs, selection would favour doublet retention [22][23][24][25] . We can detect signals of such selection by comparing evolutionary trajectories across multiple species. Since we are interested in long-term evolutionary patterns, we compared doublet occurrence in the two most distantly-related sub-clades of the WGD clade, which we refer to as the L and R sub-clades (Fig. 1A,B; "Methods"). If a paralog is present as a doublet in any member of a sub-clade, we can infer it was present as a doublet at the root of that sub-clade (Fig. 1C, bold red lines). If a paralog is present as a singleton in every member of a sub-clade, we cannot draw any conclusions: one copy may have been deleted early, prior to the divergence of the sub-clade; or deleted later, independently in every member of the sub-clade (Fig. 1C, light pink lines). Genes in the ancestral doublet set can be separated into four groups based on how they are retained in present-day species (Fig. 1B): those present as doublets in both sub-clades (L+R), those present as doublets in only one or the other sub-clade (L, R) or those present as singletons everywhere. Within the full gene set, doublet presence or absence is strongly correlated between the sub-clades ( p = 2.4E−15 , Fisher's exact test; Fig. 1B, left). The observed doublet retention correlation could be contingent on history, the result of very early losses prior to the divergence of the sub-clades (Fig. 1C, top); or it could be convergent, the result of multiple later losses within each sub-clade (Fig. 1C, bottom). In a contingent scenario, the pattern arises purely due to shared ancestry, and is not connected to gene function. In a convergent scenario, the pattern arises because homologous genes have similar loss rates across different lineages. In this case, we expect a correlation in doublet loss between the sub-clades, conditioned on the doublet still being present in their last common ancestor. We can distinguish between contingent and convergent deletions as follows (Fig. 1D). We pick a sub-clade, and only look at paralogs present as doublets in at least one member of that sub-clade; this enforces the condition that paralogs must be present as doublets in the last common ancestor of both sub-clades. For each paralog, we count the number of doublet species in this sub-clade. Finally, we split these paralogs into two groups, depending on whether or not they are present as doublets in some member of the other sub-clade. If the doublet enrichment pattern is purely contingent, doublet species counts should be statistically indistinguishable between these two groups. Instead we find (Fig. 1D) that they are highly distinct ( p = 8.3E−16 for L sub-clade counts, p = 4.2E−6 for R sub-clade counts, Kolmogorov-Smirnov test). This rules out pure contingency, and implies convergent selection: some types of genes are more likely to be retained as doublets, and others more likely to revert to singletons, with losses occurring independently across species (Fig. 1C, bottom).
Vesicle traffic genes have the highest fold enrichment among doublets. Our analysis is consistent with previous work showing that doublet retention is correlated with function 19,[25][26][27] . To explore this connection in an unbiased manner, we examined the GO categories enriched among the 887 S. cerevisiae genes belonging to the L+R doublet set ("Methods"; Table 1; note that some L+R doublets are singletons in S. cerevisiae). Among the top GO categories ranked by statistical significance, 'endocytosis' has by far the highest fold enrichment (3.14× ), (with the super-category of 'vesicle mediated transport' also featuring on the list). Remarkably, this fold enrichment is even greater than that of the ribosomal genes typically presented as an extreme example of paralog retention 28 . The same result holds true across the entire WGD clade: even accounting for the genomewide correlation of paralog doublets across species, vesicle traffic genes are significantly over-represented among the L+R set (66/360 = 0.18 compared to 520/4866 = 0.11; p = 5.2E−6 , Fisher's exact test; Fig. 1B).
To further explore the role of function in doublet retention, we grouped genes into classes and modules. We define a module as a set of genes whose protein products act collectively to carry out specific vesicle traffic www.nature.com/scientificreports/ functions at specific cellular locations 1,7 . We manually assigned these genes to seven functional classes (Fig. 2, left) based on annotations from the Saccharomyces Genome Database 29 ("Methods"). We further sub-divided vesicle coats and adaptors into seven modules based on the traffic pathways where they are active 30 (Fig. 2 Doublets are retained in secretory and early-endocytic pathways. We next considered the impact of paralogous modules in the context of the global yeast vesicle traffic system (Fig. 3A). Traffic pathways can be Function-specific and pathway-specific modules. We group 204 ancestral vesicle traffic genes into seven functional classes. We further sub-divide the Coat/Adaptor class into seven pathway-specific modules. ER endoplasmic reticulum, PM plasma membrane, EE/TGN early endosome/trans-Golgi network, LE/PVE late endosome/pre-vacuolar endosome. Note that some gene products can act at multiple locations. Genes are labeled by the names of the corresponding S. cerevisiae homologs. We track whether each gene is present as a doublet (dark blue), a singleton (light blue), or has been completely lost (white) in each of the 12 species of the WGD clade. This information is represented as a matrix: rows correspond to genes, columns correspond to species. The L and R sub-clades are separated for visual clarity. The Coat/Adaptor portion of the matrix is shown expanded on the right. Paralogs present as doublets in both the L and R sub-clades are highlighted in red. Under each class or module description, we show the enrichment of L+R doublets compared to the background expectation. P-value calculations are described in the main text and "Methods"; * represents statistically significant enrichment or depletion.   In contrast to the above cases, several modules have completely reverted to singletons. These include: the COPI coat and all components of the AP adaptor complexes 30 ; the entire set of coiled-coil and multi-subunit tethers 35 ; and the ESCRT complex. With the exception of the exocyst complex and the AP-2 adaptor complex, which both act at the plasma membrane, the remaining singletons are all involved in retrograde Golgi traffic and late endocytic steps. The TRAPPI tether participates in ER-to-Golgi transport. The COPI and AP-1 coats, along with tethers GARP, COG, TRAPPII and DSL1, facilitate intra-Golgi cycling and Golgi-to-ER transport. The tethers CORVET and HOPS are involved in late endosomal and vacuolar dynamics, along with the ESCRT complex which remodels late endosomal membranes.

Retained doublets have lower evolutionary rates and fewer protein interactions. Cross inter-
actions between paralogous modules are common in newly formed yeast hybrids, even when parental species have diverged over 50 million years 36 . Tightly-interacting modules may be subject to dominant negative effects due to mutations in their paralogous partners, suggesting doublets involving highly interacting proteins are more likely to revert to singletons. However, it is also known that highly interacting proteins have lower evolutionary rates 37 , and in turn, lower evolutionary rates are correlated with lower rates of gene loss 38,39 . We sought to understand which of these two effects dominates.
As a proxy for the evolutionary rate of each ancestral doublet, we used the nucleotide sequence identity between the corresponding orthologs in present-day members of the ZT and KLE clades ("Methods"; this avoids the confounding effect of evolutionary rate variation between singletons and doublets in WGD clade species). We imputed a physical interaction network among the proteins encoded by ancestral doublets, using presentday interaction data for the corresponding proteins in S. cerevisiae 40 ("Methods"). We separated all 360 ancestral vesicle traffic doublets into two groups: those present as pure singletons across all present-day species, and those present as doublets in at least one present-day species. We then compared the distributions of evolutionary rates and protein interaction degrees between these two groups (Fig. 3B,C). We find that doublet retention is strongly associated with lower evolutionary rates (higher sequence identity; p = 3.3E−6 , Kolmogorov-Smirnov test; Fig. 3B). In contrast, doublet retention is only weakly associated with fewer protein interactions ( p = 2.6E−3 , Kolmogorov-Smirnov test; Fig. 3C). This is consistent with prior observations: cross-interactions after gene duplication are weakly correlated with fitness, due to compensatory mechanisms such as expression attenuation 21 ; in contrast, low evolutionary rates are strongly correlated with low gene loss because functionally important genes tend to be under purifying selection 39 . Taken together, these data reinforce our finding that doublet retention is driven by selection for specific function. Pre-WGD hybrids are typically sterile (red cross) but WGD restores fertility (green tick). Allelic pairs in the pre-WGD cell become paralog pairs in the post-WGD cell. The WGD step can occur in one of two ways 41 : deletion of the MAT locus (diagonal arrow), which converts the pre-WGD diploid to a genome-doubled allodiploid (effectively a mating-competent haploid) 42,43 ; or endo-reduplication (vertical arrow), which converts the pre-WGD diploid to a genome-doubled allotetraploid (effectively a fertile diploid) 44 . In the allodiploid, gene conversion can lead to loss of one parental variant of a paralog pair. In the allotetraploid, gene conversion occurs predominantly between the WGD-derived alleles rather than the hybridization-derived paralogs. (B) Typical organization of allotetraploid chromosomes soon after WGD. This schematic is based on the genome of the lager brewing yeast S. pastorianus, a 500-y-old interspecies hybrid 45,46 . All genes are present as allelic pairs, most of which are completely homozygous due to gene conversion or introgression. Homeologs are paralog pairs in which one variant is inherited from each parent. Ohnologs are paralog pairs in which one parental variant has been replaced by the other, by meiotic recombination in a pre-WGD hybrid diploid or by subsequent gene conversion. www.nature.com/scientificreports/

Discussion
Genome doubling is a recurring theme in eukaryotic evolution 20 . These events provide many opportunities for selection to act, and can reveal evolutionary pressures that are invisible under normal circumstances. In this study we have taken advantage of an ancient genome doubling event to rigorously demonstrate signatures of such selection on the yeast vesicle traffic system. Interspecies hybridization is a common route to genome doubling among fungi [47][48][49] . Interspecies diploids are typically sterile, since mismatches between homologous chromosomes stall meiosis; genome doubling spontaneously restores fertility in such hybrids, by creating an allotetraploid cell with two identical copies of each chromosome 41,43 (Fig. 4A). The alleles of the original hybrid diploid become paralog doublets of the allotetraploid. Paralogs are always at risk of being lost due to gene conversion, which occurs when homologous template sequences are used to repair double-strand breaks 50 . Newly-formed hybrid allotetraploid genomes typically contain both homeologs (pairs derived from both parents) and ohnologs (pairs tracing to a single parent, due to pre-WGD gene conversion) 51,52 (Fig 4B). Gene conversion rapidly erases variations between ohnologs 10,42,44 but spares the more diverged homeologs, since double-strand break repair in allotetraploids uses alleles as templates 45,46 (Fig. 4B). Homeologs thus have more opportunities for neo-functionalization or sub-functionalization, compared to ohnologs. Consistent with this, the majority of paralogs retained as doublets in present-day S. cerevisiae (63%), including most vesicle traffic doublets (66%), are homeologs ("Methods").
Paralogous modules have been retained in the yeast vesicle traffic system at rates much higher than expected by chance, 100 million years after they arose by hybridization. Even more striking is the clear convergence of evolutionary trajectories across diverse yeast species, indicating common selection pressures operating on the vesicle traffic system over time and across ecological contexts. It is likely that there are multiple mechanisms by which paralog doublets confer a selective advantage. For paralogs with highly overlapping functions, gene dosage may act to increase the capacity of vesicle traffic pathways 53 ). In all these ways, the presence of doublets potentially increases the versatility of the vesicle traffic system.
The last eukaryotic common ancestor (LECA) had distinct, homologous versions of Arfs, Rabs, coats, and SNAREs which operated along distinct trafficking pathways. These same protein families comprise organellespecific modules in all present-day eukaryotes. Yet the patterns of gene duplication subsequent to LECA appear to vary across lineages. The large family of tethers, essential for vesicle fusion, is comparable in size to other protein classes involved in vesicle traffic 35 . In yeast, tethers seem to have expanded by sporadic recruitment rather than by gene duplication 72 . This pattern finds echoes in our analysis: strikingly, all 47 tethers in our dataset have reverted to singletons in nearly every WGD species. However, this pattern does not hold outside of fungi: metazoans have two paralogs of the HOPS and CORVET components VPS16 and VPS33 73 , while ciliates have paralogous copies of multiple CORVET components 72 . More broadly, the pre-LECA and post-LECA phases of gene family expansion appear to be fundamentally different in character. Though we find that many paralogs derived from genome doubling have been retained within the yeast vesicle traffic system, there are nearly no examples of paralogs operating at highly distinct sub-cellular locations. This suggests that the architecture of vesicle traffic in present-day eukaryotes is tightly constrained, and that the genome doubling route we have explored is distinct from more ancient duplication routes. It is likely the major vesicle traffic gene families were generated during an earlier, more dynamic and less constrained phase of eukaryotic evolution.
Defining the ancestral paralog doublet set. We are interested in orthologs that were present as paralog doublets immediately following the original interspecies hybridization. By definition, one copy of each such gene is inherited from each parent. However, we do not know the true genetic complement of the parental species, only that of their closest living relatives. Operationally, we define the set of ancestral doublets as the set of 4866 genes found across the ZT, KLE and WGD clades. This definition has good specificity (3891 genes in this set are present in every member of the ZT and KLE clades, and were therefore likely to be inherited as doublets in the WGD ancestor) and good sensitivity (only 48 out of the 1374 genes present as doublets in any present-day WGD species are are missing from this set). 4075 out of 4866 genes are present in at least one copy in every present-day WGD species. The full list of ancestral doublets is provided in Supplementary Table S1. www.nature.com/scientificreports/ Classifying doublets as homeologs and ohnologs. The time of duplication of paralog doublets has been estimated using phylogenetic methods, as described in 13 . We obtained supporting data associated with this study. The duplication event is assigned to a branch of a species tree spanning the KLE, ZT and WGD clades, as defined in Fig. 1 of Ref. 13 . Each doublet is associated with two inferred duplication branches, based on the phylogenetic trees seeded by each doublet member. In the event that the two branches do not match, we retained the branch with higher support. Those with support below 0.95 were not considered. In this way, we could assign the duplication branch for 60% (377/620) of S. cerevisiae paralog doublets (Supplementary Table S1). Homeologs correspond to branches ≤ n4 (duplicated prior to WGD) and ohnologs correspond to branches ≥ n5 (duplicated at or after WGD Protein interaction analysis. The 360 ancestral vesicle traffic genes correspond to 420 S. cerevisiae genes (300 singletons and 60 doublets). For these genes we obtained the protein-protein interaction network from the STRING database 40 (string-db.org), filtering for the physical sub-network at medium confidence, with experiments and databases as interaction data sources. For genes present as doublets, we assumed an interaction between a pair of ancestral genes if there was an interaction between any of their paralogs, as would be expected based on a sub-functionalization scenario. Note that this systematically increases the inferred interaction degree of doublets. Even so, we find that doublets overall have fewer interactions than singletons (Fig. 3C). Interaction data are provided in Supplementary Table S1.
Evolutionary rate analysis. To estimate the evolutionary rate of ancestral paralog doublets, we examined the sequence identity between the corresponding orthologs in Z. rouxii and K. lactis (representative species for the ZT and KLE clade). We used the YGOB database (Version 7; ygob.ucd.ie) to assign orthologs. Of the 4866 ancestral doublets, 4585 had orthologs in Z. rouxii and K. lactis, of which 4490 sequences were protein-coding gene pairs. We aligned the corresponding gene pairs and computed the nucleotide sequence identity. The mean identity was 0.644 ± 0.003 for ancestral doublets with at least one present-day doublet, and 0.616 ± 0.001 for ancestral doublets with no present-day doublets (mean ± SEM). Sequence identity values are provided in Supplementary Table S1.

Data availability
All data generated or analysed during this study are included in this published article and its Supplementary Information files. www.nature.com/scientificreports/